Project 1 - Insurance Premium Default Propensity Prediction

1.1 Project Objective:

Premium paid by the customer is the major revenue source for insurance companies. Default in premium payments results in significant revenue losses and hence insurance companies would like to know upfront which type of customers would default premium payments.

The objective of this project

  • Build a model that can predict the likelihood of a customer defaulting on premium payments (Who is likely to default)
  • Identify the factors that drive higher default rate (Are there any characteristics of the customers who are likely to default?)
  • Propose a strategy for reducing default rates by using the model and other insights from the analysis (What should be done to reduce the default rates?)

Problem Statement

“Explore the data to identify customers with the propensity to default on the premiums to be paid to the Insurance company. Identify the factors/ characteristics influencing these default behavior to predict the probability of who will default, so as to help the Insurance Agents to proactively reach out these customers with the right strategy/approach to make sure they pay their due premiums”

Need of the present study and business/social opportunity

  • The current pandemic situation engulfing the world and the impact in which we all are struggling socially, psychologically & economically has taken a Physical and Mental Impact
  • One of the major impact of the COVID 19 has been the economic fallout, closure of businesses , trade , commerce and the corresponding job loss.
  • These impact especially the influx of income/salaries has affected the purchase of Life & Health insurance policy and tendency to default on the premium payment in salaried Generation X.
  • The impact of being infected by the CoVid 19 virus and the medical and hospitalization expenses involved could be very challenging at such times especially for the elderly who have comparitively weaker immunity to fight this virus. + One need to be equipped with the right medical insurance at such times and the need to have a medical insurance and also you life covered with Life Insurance is of paramount importance.
  • Defaulting on Premiums could cost one dearly when facing such medical emergencies. It becomes a social responsibility to make people aware that trying to default on premiums because of the challenging economic constrains will be a short term action which could have a long term impact on the health/wealth condition of a individual and his/her dependents. This important message of how having your insurance active by paying premiums on time could be a life saver for all concerned, needs to be driven home.

1.2 Project Notes 1: Description of the data and the initial EDA

This Project submission Part 1 will be about:

1. Description of the Data 2. Initial Exploratory Data Analysis

  • Univariate analysis (distribution and spread for every continuous attribute, distribution of the data in categories for categorical ones)
  • Bivariate analysis (relationship between different variables, correlations)

2. Assumptions

  • One assumption is that there is a high enough confidence level that the sample is representative of the population.
  • The other assumption is there were no errors in data collection and data documentation

Data Report Understand how data was collected in terms of time, frequency, methodology

  • The data has been collected traditionally by the insurance company for the
  • The data collected is up to the present times where the identified defaulters are the ones who have defaulted on their latest insurance premiums.
  • The data is of the same insurance company hence the user details, patterns and behavior towards paying the premiums have been recorded over time and presented for analysis.
  • The normal period for paying of insurance premiums are yearly/annually. But the same can be flexible and tailored as per the customers request i.e. bi-annually, monthly, etc. We have not been supplied with those details with regards to the premiums mentioned. In the event of no specific mentioning of this information, we will assume these are annual premiums to be paid (eg. health premium, auto premium, etc.)
  • In event of insurance which provide the facility of paying monthly (flexible), we should informed likewise to propose the same to some cohorts of the customers who can provided with this option to make the payments of premium more easier at their end.

P.S. If possible, we should approach the insurance company (client) to share some more details with respect to the identified information missing in the dataset to help us analyze further. Some of those details could include Gender, Type of Insurance purchased, premium frequency,

3. Exploratory Data Analysis – Step by Step Approach

  • The following steps have been undertaken when exploring the dataset:

  • Environment Set up and Data Import

  • Variable Identification

  • Univariate Analysis

  • Bi-Variate Analysis

  • Variable Transformation / Feature Creation

  • Conclusions

knitr::opts_chunk$set(error = FALSE,     # suppress errors
                      message = FALSE,   # suppress messages
                      warning = FALSE,   # suppress warnings
                      echo = FALSE,      # suppress code
                      cache = TRUE)      # enable caching

3.1 Environment Set up and Data Import

3.1.1 Install necessary Packages and Invoke Libraries

3.1.2 Set up work directory

3.1.3 Import and Read the Dataset

3.1.4 Clear Description of the attributes (Name of columns) in the data set

The dataset contains the following information about 79854 policy holders:

  1. id: Unique customer ID
  2. perc_premium_paid_by_cash_credit: What % of the premium was paid by cash payments?
  3. age_in_days: age of the customer in days
  4. Income: Income of the customer
  5. Marital Status: Married/Unmarried, Married (1), unmarried (0)
  6. Veh_owned: Number of vehicles owned (1-3)
  7. Count_3-6_months_late: Number of times premium was paid 3-6 months late
  8. Count_6-12_months_late: Number of times premium was paid 6-12 months late
  9. Count_more_than_12_months_late: Number of times premium was paid more than 12 months late
  10. Risk_score: Risk score of customer (similar to credit score)
  11. No_of_dep: Number of dependents in the family of the customer (1-4)
  12. Accommodation: Owned (1), Rented (0)
  13. no_of_premiums_paid: Number of premiums paid till date
  14. sourcing_channel: Channel through which customer was sourced
  15. residence_area_type: Residence type of the customer
  16. premium : Total premium amount paid till now
  17. default: Y variable - 0 indicates that customer has defaulted the premium and 1 indicates that customer has not defaulted the premium
  18. agegroup: Age grouped in 8 buckets for a more descriptive analysis of Age vs various other variants- (later added to the data set)

Observation:

  • The identity of the customers does not reveal the geographical or geographical characteristics.
  • There is no Gender characteristic provided for analyzing the sex of the customers involved
  • The age is displayed in “Number of days” which isn’t the normally used to calculate age. We would change this to “Number of Years” to easy analysis.
  • There is no time period that is specified for this data, hence some assumptions will need to be made.
  • Details of the Sourcing Channels mentioned could give more insight on the type of sourcing which could be considered while analysing the data and introspecting to look into for more insights.
  • When we take an overview of the Age and the Number of Premiums Paid, we can introspect if there is any relation with the more number of premiums to age or not.

3.2 Variable Identification

Structure of Dataset

VIEW PLOT: The plot structure reveals the various variables of the data frame

Check the dimension of the dataset

## [1] 79853    17

DIMENSIONS: shows Columns = 17 and Rows = 79, 853

Sanity Checks

HEAD & TAIL: A uniform similarity is observed in the data frame’s columns and rows using the “head’ &”tail" function.

Check the structure of dataset

## tibble [79,853 × 17] (S3: tbl_df/tbl/data.frame)
##  $ id                              : num [1:79853] 1 2 3 4 5 6 7 8 9 10 ...
##  $ perc_premium_paid_by_cash_credit: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ age_in_days                     : num [1:79853] 11330 30309 16069 23733 19360 ...
##  $ Income                          : num [1:79853] 90050 156080 145020 187560 103050 ...
##  $ Count_3-6_months_late           : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count_6-12_months_late          : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count_more_than_12_months_late  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital Status                  : num [1:79853] 0 1 0 1 0 0 0 0 1 1 ...
##  $ Veh_Owned                       : num [1:79853] 3 3 1 1 2 1 3 3 2 3 ...
##  $ No_of_dep                       : num [1:79853] 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation                    : num [1:79853] 1 1 1 0 0 0 1 0 1 1 ...
##  $ risk_score                      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ no_of_premiums_paid             : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ sourcing_channel                : chr [1:79853] "A" "A" "C" "A" ...
##  $ residence_area_type             : chr [1:79853] "Rural" "Urban" "Urban" "Urban" ...
##  $ premium                         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ default                         : num [1:79853] 1 1 1 1 0 1 1 1 1 1 ...

STRUCTURE OF DATASET: There are some variables in the data set which are numerical in nature whose format needs to be changed for proper analysis.

Introducing Plot Dimensions

PLOT DIMENSIONS:

– The plot intro shows 12% of the columns are discrete in nature while 88% are continuous. This will change as the formats of some variables will be changed for analysis. – There are no missing columns or rows and no missing observations, which indicates the data is uniform and complete with no undesired discrepancies.

Get Summary of the dataset

##        id        perc_premium_paid_by_cash_credit  age_in_days   
##  Min.   :    1   Min.   :0.0000                   Min.   : 7670  
##  1st Qu.:19964   1st Qu.:0.0340                   1st Qu.:14974  
##  Median :39927   Median :0.1670                   Median :18625  
##  Mean   :39927   Mean   :0.3143                   Mean   :18847  
##  3rd Qu.:59890   3rd Qu.:0.5380                   3rd Qu.:22636  
##  Max.   :79853   Max.   :1.0000                   Max.   :37602  
##      Income         Count_3-6_months_late Count_6-12_months_late
##  Min.   :   24030   Min.   : 0.0000       Min.   : 0.00000      
##  1st Qu.:  108010   1st Qu.: 0.0000       1st Qu.: 0.00000      
##  Median :  166560   Median : 0.0000       Median : 0.00000      
##  Mean   :  208847   Mean   : 0.2484       Mean   : 0.07809      
##  3rd Qu.:  252090   3rd Qu.: 0.0000       3rd Qu.: 0.00000      
##  Max.   :90262600   Max.   :13.0000       Max.   :17.00000      
##  Count_more_than_12_months_late Marital Status     Veh_Owned    
##  Min.   : 0.00000               Min.   :0.0000   Min.   :1.000  
##  1st Qu.: 0.00000               1st Qu.:0.0000   1st Qu.:1.000  
##  Median : 0.00000               Median :0.0000   Median :2.000  
##  Mean   : 0.05994               Mean   :0.4987   Mean   :1.998  
##  3rd Qu.: 0.00000               3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :11.00000               Max.   :1.0000   Max.   :3.000  
##    No_of_dep      Accomodation      risk_score    no_of_premiums_paid
##  Min.   :1.000   Min.   :0.0000   Min.   :91.90   Min.   : 2.00      
##  1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:98.83   1st Qu.: 7.00      
##  Median :3.000   Median :1.0000   Median :99.18   Median :10.00      
##  Mean   :2.503   Mean   :0.5013   Mean   :99.07   Mean   :10.86      
##  3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:99.52   3rd Qu.:14.00      
##  Max.   :4.000   Max.   :1.0000   Max.   :99.89   Max.   :60.00      
##  sourcing_channel   residence_area_type    premium         default      
##  Length:79853       Length:79853        Min.   : 1200   Min.   :0.0000  
##  Class :character   Class :character    1st Qu.: 5400   1st Qu.:1.0000  
##  Mode  :character   Mode  :character    Median : 7500   Median :1.0000  
##                                         Mean   :10925   Mean   :0.9374  
##                                         3rd Qu.:13800   3rd Qu.:1.0000  
##                                         Max.   :60000   Max.   :1.0000
##  [1] "id"                               "perc_premium_paid_by_cash_credit"
##  [3] "age_in_days"                      "Income"                          
##  [5] "Count_3-6_months_late"            "Count_6-12_months_late"          
##  [7] "Count_more_than_12_months_late"   "Marital Status"                  
##  [9] "Veh_Owned"                        "No_of_dep"                       
## [11] "Accomodation"                     "risk_score"                      
## [13] "no_of_premiums_paid"              "sourcing_channel"                
## [15] "residence_area_type"              "premium"                         
## [17] "default"

SUMMARY OF THE DATASET:

– The summary shows variables namely Marital Status, Vehicles Owned, Number of Dependents, Accommodations & Default will need to changed to factors for a correct representation of what the data is displaying. – Age is also displayed in ’Days" which need to be changed to ’Years" – After changing the characters of the mentioned variables, we will further explore the summary in detail.

3.3 Missing value treatment

##                               id perc_premium_paid_by_cash_credit 
##                                0                                0 
##                      age_in_days                           Income 
##                                0                                0 
##            Count_3-6_months_late           Count_6-12_months_late 
##                                0                                0 
##   Count_more_than_12_months_late                   Marital Status 
##                                0                                0 
##                        Veh_Owned                        No_of_dep 
##                                0                                0 
##                     Accomodation                       risk_score 
##                                0                                0 
##              no_of_premiums_paid                 sourcing_channel 
##                                0                                0 
##              residence_area_type                          premium 
##                                0                                0 
##                          default 
##                                0

MISSING VALUE PLOT: This functions details out that there are no missing values in any of the columns (variables)

3.4 Variable Transformation - Convert “Age in Days” variable to “Age” - in years

VARIABLE TRANSFORMATION: The “Age in Days” column is used to create a new variable column of ’Age" which is represented in years.

3.5.1 Feature Creation - Adding a new variable ’agegroup" to bucket the age in groups

FEATURE CREATION: The “age” - in years is used to create a new feature variable i.e. Age Group. The age component is slotted into eight buckets from lowest to highest in ascending order. The eight age groups are as follows

  • 1 = Ages between 12 & 29
  • 2 = Ages between 30 & 39
  • 3 = Ages between 40 & 49
  • 4 = Ages between 50 & 59
  • 5 = Ages between 60 & 69
  • 6 = Ages between 70 & 79
  • 7 = Ages between 80 & 89
  • 8 = Ages between 90 & 103

3.5.2 Drop unwanted variables

DROPPING THE UNWANTED VARIABLES FROM THE DATA SET:

– “Age in days” is removed and replaced instead by “Age in Years’ –”ID" is removed as it won’t be serving any purpose in the analysis.

3.5.3 Income variable (reducing by “000”) for analysis

reducing the income variable by diving it by 1000 to have an adaptable range for analyzing

3.5.4 Feature Creation - Converting “0” & “1’ into character variables for better refrence during analysis ### Converting”0" & "1’ into character variables for better refrence during analysis

3.6.1 Changing variable as factors

CHANGING VARIABLES AS FACTORS: The variables, namely Marital Status, Vehicles Owned, Number of Dependents, Accommodations & Default are appropriately changed to factors to make them discrete observations.

3.6.2 Rechecking the changed variables

## tibble [79,853 × 17] (S3: tbl_df/tbl/data.frame)
##  $ perc_premium_paid_by_cash_credit: num [1:79853] 0.317 0 0.015 0 0.888 0.512 0 0.994 0.019 0.018 ...
##  $ Income                          : num [1:79853] 90 156 145 188 103 ...
##  $ Count_3-6_months_late           : num [1:79853] 0 0 1 0 7 0 0 0 0 0 ...
##  $ Count_6-12_months_late          : num [1:79853] 0 0 0 0 3 0 0 0 0 0 ...
##  $ Count_more_than_12_months_late  : num [1:79853] 0 0 0 0 4 0 0 0 0 0 ...
##  $ Marital Status                  : Factor w/ 2 levels "Married","Not Married": 2 1 2 1 2 2 2 2 1 1 ...
##  $ Veh_Owned                       : Factor w/ 3 levels "1","2","3": 3 3 1 1 2 1 3 3 2 3 ...
##  $ No_of_dep                       : Factor w/ 4 levels "1","2","3","4": 3 1 1 1 1 4 4 2 4 3 ...
##  $ Accomodation                    : Factor w/ 2 levels "Owned","Rented": 1 1 1 2 2 2 1 2 1 1 ...
##  $ risk_score                      : num [1:79853] 98.8 99.1 99.2 99.4 98.8 ...
##  $ no_of_premiums_paid             : num [1:79853] 8 3 14 13 15 4 8 4 8 8 ...
##  $ sourcing_channel                : Factor w/ 5 levels "A","B","C","D",..: 1 1 3 1 1 2 3 1 1 1 ...
##  $ residence_area_type             : Factor w/ 2 levels "Rural","Urban": 1 2 2 2 2 1 1 2 2 1 ...
##  $ premium                         : num [1:79853] 5400 11700 18000 13800 7500 3300 20100 3300 5400 9600 ...
##  $ default                         : Factor w/ 2 levels "Defaulted","Not Defaulted": 2 2 2 2 1 2 2 2 2 2 ...
##  $ age                             : int [1:79853] 31 82 43 64 53 45 44 39 75 81 ...
##  $ agegroup                        : num [1:79853] 2 7 3 5 4 3 3 2 6 7 ...
##  perc_premium_paid_by_cash_credit     Income         Count_3-6_months_late
##  Min.   :0.0000                   Min.   :   24.03   Min.   : 0.0000      
##  1st Qu.:0.0340                   1st Qu.:  108.01   1st Qu.: 0.0000      
##  Median :0.1670                   Median :  166.56   Median : 0.0000      
##  Mean   :0.3143                   Mean   :  208.85   Mean   : 0.2484      
##  3rd Qu.:0.5380                   3rd Qu.:  252.09   3rd Qu.: 0.0000      
##  Max.   :1.0000                   Max.   :90262.60   Max.   :13.0000      
##  Count_6-12_months_late Count_more_than_12_months_late     Marital Status 
##  Min.   : 0.00000       Min.   : 0.00000               Married    :39821  
##  1st Qu.: 0.00000       1st Qu.: 0.00000               Not Married:40032  
##  Median : 0.00000       Median : 0.00000                                  
##  Mean   : 0.07809       Mean   : 0.05994                                  
##  3rd Qu.: 0.00000       3rd Qu.: 0.00000                                  
##  Max.   :17.00000       Max.   :11.00000                                  
##  Veh_Owned No_of_dep Accomodation     risk_score    no_of_premiums_paid
##  1:26746   1:19840   Owned :40030   Min.   :91.90   Min.   : 2.00      
##  2:26520   2:19902   Rented:39823   1st Qu.:98.83   1st Qu.: 7.00      
##  3:26587   3:20215                  Median :99.18   Median :10.00      
##            4:19896                  Mean   :99.07   Mean   :10.86      
##                                     3rd Qu.:99.52   3rd Qu.:14.00      
##                                     Max.   :99.89   Max.   :60.00      
##  sourcing_channel residence_area_type    premium               default     
##  A:43134          Rural:31670         Min.   : 1200   Defaulted    : 4998  
##  B:16512          Urban:48183         1st Qu.: 5400   Not Defaulted:74855  
##  C:12039                              Median : 7500                        
##  D: 7559                              Mean   :10925                        
##  E:  609                              3rd Qu.:13800                        
##                                       Max.   :60000                        
##       age            agegroup    
##  Min.   : 20.00   Min.   :1.000  
##  1st Qu.: 40.00   1st Qu.:3.000  
##  Median : 50.00   Median :4.000  
##  Mean   : 50.91   Mean   :3.643  
##  3rd Qu.: 61.00   3rd Qu.:5.000  
##  Max.   :102.00   Max.   :8.000

RECHECKING VARIABLES:

  • The structure of the data now shows that we have variables in numerical and factor categories. Each of them will accordingly be treated and analysed to explore the dataset to understand the characteristic of each ID.

  • Among the factor variables we observe:

  • Number of vehicles owned = almost equal number of people own 1, 2 and 3 number of vehicles.

  • Number of Dependents = again, an almost similar amount of dependents in the 1, 2, 3 & 4 category.

  • Accommodations owned = not much difference in the numbers who own their own accommodation v/s those who dont.

  • Sourcing Channel= 54% of the customers are sourced from Source Channel A. while Source Channel E does not amount to even 1%.

  • Residential Area Type = 60% come from urban area

  • Premium Defaults = 6.26% of the cohort have defaulted on paying the premiums.

  • Age = The average age is in the 50s, while the range is wide. There seem to outlier with the maximum age of 102 years.

3.7 Univariant Analysis

3.7.1 Y Variable “Default” - table split

## 
##     Defaulted Not Defaulted 
##      6.259001     93.740999

DEFAULT VARIABLE: The table split shows that 6.26% of the people have defaulted in their payment of insurance premium

3.7.2 Function to draw histogram and boxplot of numerical variables using ggplot

3.7.3 Visualize properties of all categorical variables

1. Observations on Age: * There seems an almost normal distribution in the Age of the customers. * The range is spread from 21 to 92 with approximately nine outliers between 93 and 102 * The range is concentrated around 41 years to 62 years * Mean & Median, both are around 51 suggesting there aren’t many outliers influencing the mean.

2. Observations on Age Group:

* Age Group 3 sees the maximum amount of individuals, followed by Group 4 * The median falls around Group 4

3. Observation on Income:

* The income range is very widely dispersed. The mean is 208,850 where the median is 166,560 which denotes some very extreme outliers are influencing the mean. * There 3rd quartile is at 252,090 and the maximum stretches to 9,0262,600 which shows the income range is widely dispersed with some making very huge amount compared to the concentration which lie between 108,010 & 252,090

4. Observation on Premium paid in Cash:

  • The data seems right skewed (positive) with the bulk in the range of 3% to 53% around the median.
  • The mean (31%) here is more than median (16.7%) suggesting outliers which are influencing the mean.

5 Observation on late payment of Premium by 3 to 6 months:

  • Data suggest that there are very few who have paid their premiums late i.e. after 3 to 6 months.
  • Infact the median and also the 3rd quartile is in 0 range.
  • The mean is 0.24 which suggests outliers to a maximum of 13 number.
  • Almost 84% of the customers have not paid their premiums late i.e. after 3 to 6 months.
  • 11% have been late 1 time
  • 35 have been late 2 times.
  • 114 have been late more than 5 times and are major outliers here.

6.Observation on late payment of Premium by 6 to 12 months

  • Data suggest that there are very few who have paid their premiums late i.e. after 6 to 12 months.
  • Infact the median and also the 3rd quartile is in 0 range.
  • The mean is 0.07 which suggests outliers to a maximum of 17 number.
  • More than 95% of the customers have not paid their premiums late i.e. after 6 to 12 months.
  • 3.3% have been late 1 time
  • 59 have been late more than 5 times and are major outliers here.

7. Observation on late payment of Premium by more than 12 months:

  • Data suggest that there are very few who have paid their premiums later than 12 months
  • Infact the median and also the 3rd quartile is in 0 range.
  • The mean is 0.05 which suggests outliers to a maximum of 11 number.
  • More than 95% of the customers have not paid their premiums late i.e. after 6 to 12 months.
  • 3.7% have been late 1 time
  • 12 have been late more than 5 times and are major outliers here.

8. Observation on Risk score of customers:

  • Risk score which is similar to “credit score” is extremely left skewed (negative).
  • The mean and Median are around 99 which denoted there aren’t any outliers influencing the data
  • The data is majorly at the score of around 99.
  • The left skewed data is skewed between 98 & 92.

9.Observation on Premiums paid by customers:

  • Premiums paid by customers is quite positive skewed and spread out with a huge range between the minimum & maximum plus the range between 3rd quartile and 1st quartile.
  • The Median is at 7500 but the mean is at 10,925 showing the presence of huge outliers which is skewing the data to the right.
  • The range of outliers, ranges from 13800 (3rd Quartile) right upto $60,000 paid as premium.

10. Observation on the number of Premiums paid by the customers

  • The number of premiums paid by customers shows a trend similar to the premium paid by the customers.
  • The data is positive skewed and spread out between 2 & 60 numbers of premiums paid.
  • The mean & median are at 10.86 & 10 resp, showing the outliers aren’t majorly influening the data.
  • Though there are outliers which are far beyond the 3rd quartile (14) as far as 60.
  • The bulk of the nos. of premiums are between 7 & 14.

3.7.4 Setting up the aesthetics

3.7.5.1.a Plotting the Numerical variables

3.7.5.1.b Partitioning the barcharts

3.7.6.1.a Plotting the Numerical variables contd…

Marital Status: Not much difference in the 2 cohorts, with a slight edge for the unmarried.

Accommodation: Again not much difference in the ones owning their houses and renting them. A slight edge to the ones owning their houses.

Residence Area: There we see 60% of the customers coming from the urban area

3.7.6.1.b Partitioning the barcharts

Number of vehicles owned: Again equally distributed at around 33% for 1, 2 & 3 vehicles owned by the customers

Number of Dependents: The four cohorts i.e. 1, 2, 3 & 4 have similar numbers in the data. All are between 24% & 25% of the share with a slight edge for 3 dependent at 25.31%.

Sourcing Channels: of the five cohorts i.r. A,B,C,D & E the bulk of the customers at 54%vhave been sourced by Channel A. Substantial amount of customers come from Channel B (20.7%), Channel C (15%) & Channel D (9.5%)

*We can see Sourcing Channels & Residence Area are the only two verticals from where we are able to see a diversion in the customer data.

3.7.7.1 Visualize properties of all continuous variables

Age shows a almost normal distribution spread widely between 20 & 90, with the bulk between 20 & 90.

Age Groups follows the Age distribution, with highest concentration in Group 3 followed by Group 4. The median is at Group 4.

Income levels seems dispersed unevenly in the spread.

3.7.7.2 Visualize properties of all continuous variables contd…

Premiums paid sees a right skew with a sharp dip in between the rise. The concentrations is between 5400 & 13800.

The number of Premiums Paid seems to have a normal distribution with a positive skew. The concentration is between 7 & 14. Many outliers far & wide up to 60.

Risk Score sees a left skew with the concentration between 98.83 & 99.52. The tail is between 91.90 and 98.

3.7.7.3 Visualize properties of all continuous variables contd…

Premium late by 3 to 6 Months, 6 to 12 Months & more than 12 Months:

  • Looking at the three cohorts above who have been late in paying their premiums on time, its apparent that maximum numbers in all three verticals are the ones who has not been late in paying their premiums on time.

  • There seems comparatively more people who have delayed paying their premiums 1 or 2 time between 3-6 months compared to the ones who delayed their premiums between 6-12 months & beyond 12 months.

3.8 B-Variant Analysis

3.8.1 Setting up the aesthetics

3.8.2.1 Default v/s numerical variables

Age: The Defaulters are comparatively a younger cohort with the median at 46. The concentration range being between 37 & 54. The non-defaulters are at a median of 51 with a range between 41 & 62.

Age Group: The defaulters are falling in the concentration range between Group 2 & 4 with a median at 3. The non defaulters are ina a range of Group 3 & 5 with median of 4 which reflects that the concentration of defaulter comparatively fall iin a less age bracket

Income: The income levels of the defaulters show that they come from a low income category compared to the non-defaulters.

3.8.2.2 Default v/s numerical variables contd…

Delayed Premium of 3 to 6 Months: + More than 85% of the total non defaulters have never defaulted premium by 3 to 6 months. 10% defaulted once & 2.5% defaulted twice. + More than 53% of the total defaulters have never defaulted premium by 3 to 6 months. 23% defaulted once & more than 11% defaulted twice

Delayed Premium of 6 to 12 Months: + Almost 97% of the total non defaulters have never defaulted premium by 6 to 12 months. 2.5% defaulted once. + 70% of the total defaulters have never defaulted premium by 6 to 12 months. 16.5% defaulted once.

Delayed Premium of more than 12 Months: + 96.5% of the total non defaulters have never defaulted premium by more than 12 months, Approx 3% defaulted once. + More than 76% of the total defaulters have never defaulted premium by more than 12 months. 16.7% defaulted once.

3.8.2.3 Default v/s numerical variables contd…

Nos. of Premiums Paid: Both the cohorts i.e. of defaulters and non- defaulters show a similar range in the number of premiums paid. Both cohorts have a median of 10 and almost similar range.

Premium Paid:Both cohorts has the same median of 7500. The range for defaulters is comparatively smaller.

3.8.2.4. Setting up the aesthetics

3.8.2.5 “Default” vs categorical variables

Observation:

*Same kind of data distribution witnessed between defaulters and non-defaulters in the “Marital Status”, “Number of Vehicles owned”, “Number of Dependents” & “Accommodations” & “Residence Area”

3.8.3 Correlation Plot

Observations:

  • High Correlation - None of the verticals have a high correlation

  • Correlation - We can see a relative correlation between “Premium” & “Income” , “Late Payment of Premium by 3 to 6 months” & “Late Payment of Premium over 12 months”

  • Low Correlation - We see a slight correlation between “Premium paid in cash” & “Late Payment of Premium by 3 to 6 months”, “Premium paid in cash” & “Late Payment of Premium by 6 to 12 months”, “Age” & Sourcing Channel A“,”Age" & “Number of Premium Paid”

3.8.4.Density Plot

Observations:

  • The Density Plot close high density between :

  • “Late Payment of Premium by 6 to 12 months” & “Late Payment of Premium by more than 12 months” = 0.27

  • “Premium Paid in Cash” & “Late Payment of Premium by 3 to 6 months” = 0.21

  • “Premium Paid in Cash” & “Late Payment of Premium by 6 to 12 months” = 0.21

  • “Late Payment of Premium by 3 to 6 months” & “Late Payment of Premium by 6 to 12 months” = 0.20

  • “Premium” & “Nos. of Premiums” = 0.19

  • “Nos. of Premiums Paid” & “Age” = 0.18

  • “Premium Paid in Cash” & “Late Payment of Premium by more than 12 months” = 0.17

  • “Sourcing Channel” & “Premium” = 0.14

  • “Premium” & “Risk Score” = 0.13

3.8.5 Pivot Table

Observations:

  • Number of Vehicles owned is quite evenly spread across the number of dependents. No major apparent trend witnesssed.
  • Accommodations owned & rented both is rural & urban localities is approximately 50% in all categories.
  • Similar trends in the sourcing channels across all age groups i.e. highest numbers sourced from Channel A followed by Channel B. However, Group 3 (i.e. age group between) has more numbers sourced from Channel C compared to Channel B.
  • The risk score peaks between agre group 3 & 4.
  • Number of Dependents has an almost equal split between 1 & 4 dependents in the unmarried a well as married category.

Conclusion:

  • Of the given data we have 6.26% of the total customers who have defaulted their insurance premium.
  • Of these defaulters we see that around 23% of them have defaulters once before & 11% have defaulted their premium twice by 3 to 6 months.
  • Around 16.5% of the defaulters have delayed on their previous premiums once by 6 to 12 months.
  • 16.7% of the defaulters have defaulted once their premium beyond 12 months.
  • Of the Non-Defaulters, 10% have defaulted their earlier premium payment by 3 to 6 months. Around 2.55% have delayed in paying their premiums by 6 to 12 months and 2.9% have delayed beyond 12 months.
  • A huge cohort of 40% come from Rural area.
  • The marital status mix is similar in the defaulters & non-defaulters categories.
  • The Age has a mean and median at around 51. The range of the age in defaulters is between 36 years & 54 years with the median at 45. This cohort is younger compared the non-defaulters segment who are the range of 41 years to 61 years with a mean of 51 years.
  • The Income levels is comparatively lower in the defaulter category.
  • There are similarities in the Premium Paid & the Number of Premiums Paid in both categories.
  • We see some correlation (density of 0.30) between “Premium” & “Income” variables and also between “Late Payment of Premium by 3 to 6 months” & “Late Payment of Premium by more than 12 months”.
  • The cohort that needs to be analyzed further will be the :

— 15% of the defaulters earlier delaying payment on premiums by 3 months to 6 months. — 3% of the defaulters earlier delaying payment on premiums by 6 months to 12 months. — 3.5% of the defaulters earlier delaying payment on premiums by more than 12 months.